Investigation of the underlying physics or biology from empirical datarequires a quantifiable notion of similarity - when do two observed data setsindicate nearly identical generating processes, and when they do not. Thediscriminating characteristics to look for in data is often determined byheuristics designed by experts, $e.g.$, distinct shapes of "folded" lightcurvesmay be used as "features" to classify variable stars, while determination ofpathological brain states might require a Fourier analysis of brainwaveactivity. Finding good features is non-trivial. Here, we propose a universalsolution to this problem: we delineate a principle for quantifying similaritybetween sources of arbitrary data streams, without a priori knowledge, featuresor training. We uncover an algebraic structure on a space of symbolic modelsfor quantized data, and show that such stochastic generators may be added anduniquely inverted; and that a model and its inverse always sum to the generatorof flat white noise. Therefore, every data stream has an anti-stream: datagenerated by the inverse model. Similarity between two streams, then, is thedegree to which one, when summed to the other's anti-stream, mutuallyannihilates all statistical structure to noise. We call this data smashing. Wepresent diverse applications, including disambiguation of brainwaves pertainingto epileptic seizures, detection of anomalous cardiac rhythms, andclassification of astronomical objects from raw photometry. In our examples,the data smashing principle, without access to any domain knowledge, meets orexceeds the performance of specialized algorithms tuned by domain experts.
展开▼